Automatic Data Fusion with HumMer
نویسندگان
چکیده
Heterogeneous and dirty data is abundant. It is stored under different, often opaque schemata, it represents identical real-world objects multiple times, causing duplicates, and it has missing values and conflicting values. The Humboldt Merger (HumMer) is a tool that allows ad-hoc, declarative fusion of such data using a simple extension to SQL. Guided by a query against multiple tables, HumMer proceeds in three fully automated steps: First, instance-based schema matching bridges schematic heterogeneity of the tables by aligning corresponding attributes. Next, duplicate detection techniques find multiple representations of identical real-world objects. Finally, data fusion and conflict resolution merges duplicates into a single, consistent, and clean representation. 1 Fusing Heterogeneous, Duplicate, and Conflicting Data The task of fusing data involves the solution of many different problems, each one in itself formidable: Apart from the technical challenges of accessing remote data, heterogeneous schemata of different data sets must be aligned, multiple but differing representations of identical real-world objects (duplicates) must be discovered, and finally the duplicates must be merged to present a clean and consistent result to a user. Each of these tasks has been addressed individually at least to some extent. (i) Access to remote sources Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 is now state of the art of most systems, using techniques such as JDBC, wrappers, Web Services etc. (ii) Schematic heterogeneity has been a research issue for at least two decades, first in schema integration and then in schema mapping. Recently, schema matching techniques have made great progress in automatically detecting correspondences among elements of different schemata. (iii) Duplicate detection is successful in certain domains, such as address matching, and several research projects have presented domain-independent algorithms. It is usually performed as an individual task, such as the cleansing step in an ETL procedure. (iv) Data fusion, i.e., the step of actually merging multiple tuples into a single representation of a real world object, has only marginally been dealt with in research and hardly at all in commercial products. With the Humboldt Merger (HumMer) we present a tool that combines all these techniques to a one-stop solution for fusing data from heterogeneous sources. A unique feature of HumMer is that all steps are performed in an ad-hoc fashion at run-time, initiated by a user query to the sources; in a sense, HumMer performs automatic and virtual ETL. Apart from the known advantages of virtual data integration, this ondemand approach allows for maximum flexibility: New sources can be queried immediately, albeit at the price of not generating as perfect query results as if the integration process were defined by hand. To compensate, HumMer optionally visualizes each intermediate step of data fusion and allows users to interfere: The result of schema matching can be adjusted, tuples discovered as being border-line duplicates can be separated and vice versa, and finally, resolved data conflicts can be undone and resolved manually. Note that these steps are optional: In the usual case, users simply formulate a data fusion query and enjoy the query result. Ad-hoc and automatic data fusion is useful in many scenarios: Catalog integration is a typical one-time problem for companies that have merged, but it is also of interest for shopping agents collecting data about identical products offered at different sites. A customer shopping for CDs might want to supply only
منابع مشابه
Fuzzy Clustering Approach Using Data Fusion Theory and its Application To Automatic Isolated Word Recognition
In this paper, utilization of clustering algorithms for data fusion in decision level is proposed. The results of automatic isolated word recognition, which are derived from speech spectrograph and Linear Predictive Coding (LPC) analysis, are combined with each other by using fuzzy clustering algorithms, especially fuzzy k-means and fuzzy vector quantization. Experimental results show that the...
متن کاملA New Approach to Self-Localization for Mobile Robots Using Sensor Data Fusion
This paper proposes a new approach for calibration of dead reckoning process. Using the well-known UMBmark (University of Michigan Benchmark) is not sufficient for a desirable calibration of dead reckoning. Besides, existing calibration methods usually require explicit measurement of actual motion of the robot. Some recent methods use the smart encoder trailer or long range finder sensors such ...
متن کاملAbnormality Detection in a Landing Operation Using Hidden Markov Model
The air transport industry is seeking to manage risks in air travels. Its main objective is to detect abnormal behaviors in various flight conditions. The current methods have some limitations and are based on studying the risks and measuring the effective parameters. These parameters do not remove the dependency of a flight process on the time and human decisions. In this paper, we used an HMM...
متن کاملPlant Classification in Images of Natural Scenes Using Segmentations Fusion
This paper presents a novel approach to automatic classifying and identifying of tree leaves using image segmentation fusion. With the development of mobile devices and remote access, automatic plant identification in images taken in natural scenes has received much attention. Image segmentation plays a key role in most plant identification methods, especially in complex background images. Wher...
متن کاملEvaluation of an ambient noise insensitive hum-based powered wheelchair controller.
PURPOSE A recently-developed assistive technology nicknamed "the Hummer" was investigated as a potential powered wheelchair controller for individuals with severe and multiple disabilities. System performance in a noisy environment was compared to that obtained with a commercial automatic speech recognition (ASR) system. METHOD A bi-hum driving protocol was developed to allow the Hummer to se...
متن کامل